class: center, middle, inverse, title-slide # APSTA-GE 2003: Intermediate Quantitative Methods ## Lab Section 003, Week 6 ### New York University ### 10/13/2020 --- ## Reminders - **Group Assignment** - Due: **10/14/2020 11:59pm (EST)** - **Assignment 3** - Due: **10/19/2020 11:55pm (EST)** - Office hours - Monday 9 - 10am (EST) - Wednesday 12:30 - 1:30pm (EST) - By Appointments - Office hour Zoom link - https://nyu.zoom.us/j/97347070628 (pin: 2003) - Office hour notes - Available on NYU Classes - Updates on Lab Slides - Available on NYU Classes --- ## Today's Topics - Review Diagnostic Regression Analysis - How to check **linearity**? - Residual vs. Fitted Values - How to check the **normality** of residuals? - QQ-plot - How to check **equal variance**? - Residual vs. Fitted Values - How to find outliers? - Outliers - Leverage Points - Influential Points - Regression Diagnosis - Using a new dataset: `marketing` --- class: inverse, center, middle # Diagnostic Regression Analysis --- ## Linear Regression ### **Linear + Regression** ### An approach to: 1. Make prediction by fitting a regression model. 2. Explain how much variance of the dependent variable can be explained by independent variable(s). **Simple linear regression:** with only one independent variable **Multiple linear regression:** contains two or more independent variables --- ## Model Types - **Linear equation**: `\(Y_i = \beta_1 + \beta_2 \times X_i\)` - **Log equation:** `\(Y_i = \beta_1 + \beta_2 \times log(X_i)\)` - **Quadratic equation:** `\(Y_i = \beta_1 + \beta_2 \times X_i^2\)` - **Cubic equation:** `\(Y_i = \beta_1 + \beta_2 \times X_i^3\)`
--- ## Model Types Visualization - <span style="color: red;">Linear model</span>: `\(Y_i = 20 + 3 \times X_i\)` - <span style="color: green;">Log model</span>: `\(Y_i = 20 + 3 \times log(X_i)\)` - <span style="color: gray;">Quadratic model</span>: `\(Y_i = 20 + 3 \times X_i^2\)` - <span style="color: blue;">Cubic model</span> `\(Y_i = 20 + 3 \times X_i^3\)` - `\(\varepsilon \sim \mathcal{N}(0, 1)\)`
--- ## How to check linearity? What type of model should we use? (☉∀☉) ```r n <- 50 # Sample size X <- rnorm(n, mean = 3, sd = 1) # I.V. dat_tricky <- data.frame( X, * Y = 10 + 2 * X + 15 * log(X) -5 * X^2 + X^3 + rnorm(n, mean = 0, sd = 20) ) ```
--- ## To Check Linearity... Let's fit a linear regression model using the correct dataset first! ```r str(dat_linear) ``` ``` ## 'data.frame': 50 obs. of 2 variables: ## $ X: num 2.91 4.32 3.64 4.17 3.12 ... ## $ Y: num 28 32.4 30.6 31.2 29.9 ... ``` ```r # Fit a linear regression model *mod_lin <- lm(Y ~ X, data = dat_linear) *summary(mod_lin) ``` ``` ## ## Call: ## lm(formula = Y ~ X, data = dat_linear) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.7693 -0.7461 -0.2736 0.5419 2.6855 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 20.6092 0.4265 48.32 <2e-16 *** ## X 2.8983 0.1252 23.14 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.035 on 48 degrees of freedom ## Multiple R-squared: 0.9178, Adjusted R-squared: 0.916 ## F-statistic: 535.6 on 1 and 48 DF, p-value: < 2.2e-16 ``` --- ## After Fitting the Model... We can check model linearity by visualizing the distribution of residuals. ```r par(mfrow = c(1, 2)) *plot(Y ~ X, data = dat_linear) text(Y ~ X, data = dat_linear, labels = rownames(dat_linear), cex = 0.6, pos = 3) abline(a = 19.8578, b = 3.0524, col = "red") *plot(mod_lin, which = 1) ``` <!-- --> --- ## Res. vs. Fitted for Log Model ```r *mod_log <- lm(Y ~ log(X), data = dat_log) par(mfrow = c(1, 2)) *plot(Y ~ log(X), data = dat_log) text(Y ~ log(X), data = dat_log, labels = rownames(dat_log), cex = 0.6, pos = 3) *plot(mod_log, which = 1) ``` <!-- --> --- ## Res. vs. Fitted for Quadratic Model ```r *mod_quad <- lm(Y ~ X^2, data = dat_quad) par(mfrow = c(1, 2)) *plot(Y ~ X^2, data = dat_quad) text(Y ~ X^2, data = dat_quad, labels = rownames(dat_quad), cex = 0.6, pos = 3) *plot(mod_quad, which = 1) ``` <!-- --> --- ## Res. vs. Fitted for Cubic Model ```r *mod_cubic <- lm(Y ~ X^3, data = dat_cubic) par(mfrow = c(1, 2)) *plot(Y ~ X^3, data = dat_cubic) text(Y ~ X^3, data = dat_cubic, labels = rownames(dat_cubic), cex = 0.6, pos = 3) *plot(mod_cubic, which = 1) ``` <!-- --> --- ## Residuals vs. Fitted for Different Models ```r par(mfrow = c(2, 2)) *mod_lin <- lm(Y ~ X, data = dat_linear) *plot(lm(Y ~ X, data = dat_linear), which = 1) *plot(lm(Y ~ log(X), data = dat_log), which = 1) *plot(lm(Y ~ X^2, data = dat_quad), which = 1) *plot(lm(Y ~ X^3, data = dat_cubic), which = 1) ``` <!-- --> --- ## How to check the normality of residuals? Given a linear regression model, we can check the **normality** of residuals using a **Q-Q plot**. ```r par(mfrow = c(1, 2)) *plot(mod_lin, which = 2, pch = 16) X_perfect = c(1: 50) dat_perfect <- data.frame(X = X_perfect, Y = X_perfect) plot(lm(Y ~ X, dat = dat_perfect), which = 2, pch = 16, col = "red") ``` <!-- --> --- ## Q-Q Plot Also Applies to Other Model Types However, normality does not hold on certain model types. ```r par(mfrow = c(1, 3)) *plot(mod_log, which = 2, pch = 16) *plot(mod_quad, which = 2, pch = 16) *plot(mod_cubic, which = 2, pch = 16) ``` <!-- --> --- ## How to check equal variance? To check equal variance, we can again use the **Residuals vs. Fitted** plot. <!-- --> --- ## How to find outliers? To check outliers, we can calculate **standardized residuals** and plot against the fitted values. ```r *library(MASS) *dat_linear$yhat <- predict(mod_lin) *dat_linear$sd_res <- stdres(mod_lin) *plot(sd_res ~ yhat, data = dat_linear, ylim = c(-3, 3)) text(sd_res ~ yhat, data = dat_linear, labels = rownames(dat_linear), cex = 0.8, pos = 3) *abline(h = 2, col = "red") *abline(h = -2, col = "red") ``` <!-- --> --- ## Leverage Points Magnets. Attract vertically. Push down or pull up the regression line. <!-- --> --- ## Influential Points Points that have influence on both directions. <!-- --> --- ## Detecting Influential Points ```r *library(car) influencePlot(mod_lin) ``` <!-- --> ``` ## StudRes Hat CookD ## 6 -0.6110870 0.16342545 0.03695707 ## 31 -0.5627195 0.10545620 0.01893440 ## 47 2.1316996 0.04867638 0.10826146 ## 50 2.8018499 0.02002342 0.07018497 ``` --- class: inverse, center, middle # Regression Diagnosis --- ## New Dataset for Practice Dataset: [marketing.csv](https://drive.google.com/file/d/1RO3M0__Tnb_MInU9Modj8TIIuZqZG_-p/view?usp=sharing) ```r dat <- datarium::marketing ``` **Let's move to RStudio.** --- ## Contact Tong Jin - Email: tj1061@nyu.edu - Office Hours - Mondays, 9 - 10am (EST) - Wednesdays, 12:30 - 1:30pm (EST)